(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "mit :: lcs :: tr :: MIT-LCS-TR-495"

m 



LABORATORY FOR 
COMPUTER SCIENCE 





w 



Lk 



MASSACHUSETTS 
INSTITUTE OF 
TECHNOLOGY 



MIT/LCS/TR-495 



THE SPECTRAL NORM OF 
FINITE FUNCTIONS 



Mihir Bellare 



February 1991 



545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139 



This document for written for my area exam. The assigned papers were [FJS],[KM] and [SB]. 



The Spectral Norm of Finite Functions 

Mihir Bellare 



Laboratory for Computer Science 
Massachusetts Institute of Technology 



February 1, 1991 



Abstract 



In many recent results in learning and computational complexity theory which rely on Fourier 
analysis, the spectral norm plays a key role. An understanding of this quantity would appear to be 
useful in both gauging and exploiting these results, and in understanding the underlying techniques. 

This paper surveys various aspects of the spectral norm of finite functions. We consider some 
of the motivating results as well as both upper and lower bounds on the spectral norm and their 
relationships to the computational complexity of the function. 

Some of the results included here are new. In particular, we introduce a general technique for 
upper bounding the spectral norm of a decision tree over an arbitrary basis. We also extend the 
learning algorithm of Kushilevitz and Mansour [KM] to mutually independent distributions. 



Keywords: Fourier series, spectral norm, learning, decision trees. 



Contents 

1 Introduction 2 

2 Basics 3 

3 Two Motivating Results 4 

3.1 Efficient Learnability and the Spectral Norm 4 

3.2 Sparse Representations and the Spectral Norm 6 

4 Upper Bounds: Functions of Small Norm 7 

5 The Spectral Norm of Decision Trees 8 

5.1 A General Framework 8 

5.2 Some Applications: Decision Trees of Small Norm 10 

6 Lower Bounds: Functions of Large Norm 11 

6.1 General Functions 11 

6.2 Boolean Functions 11 

6.3 An Exponential Lower Bound for Majority 12 

7 Generalization: g-norms 14 

7.1 The g-Framework 14 

7.2 Efficient Learnability and the g-Norm 16 

7.3 The <7-norm of Decision Trees 18 

7.4 L q versus L 18 

A Appendix: Proof of the Combinatorial Identity 21 



1 Introduction 

Most functions are hard to learn efficiently. Functions of small spectral norm, say Kushilevitz and 
Mansour [KM], are an exception: they can be learned in polynomial time. 

Moreover these functions can be approximated by sparse Fourier series [SB]. They are also com- 
putable by threshold circuits of very small depth [Br],[HMPST],[BS],[SB]. And so on: the list of their 
virtues is a long one. 

Small is indeed beautiful. And that is just one reason to be interested in spectral norms. 

Fourier Series and the Spectral Norm. The Fourier series of a function / : {0,1}" -» R 
is obtained by expressing it as a linear combination of certain appropriately chosen basis functions: 
f = E* 6 {o i}" f( z )Xz where the X z ■ {0, 1}" -► {-1, +1} are the basis functions and the f(z) are real 
numbers depending on /. The spectral norm of /, denoted L(J), is the sum of the absolute values of 
the coefficients in this linear combination: L(f) = ]C Z 6{o,i}» l/( z )l- 

Fourier series have of late proved to be a very useful tool in learning and computational complexity 
theory. And as the above indicates, the spectral norm often plays a pivotal role. 

It is not just that small is beautiful. If one examines these results more closely, one realizes that 
the spectral norm is a kind of "yardstick," its value for a given function measuring the "complexity" 
of this function with respect to the task at hand. For example, what the [KM] learning result really 
says is that a function is learnable in time polynomial in its spectral norm. The sparse approximation 
result [SB] really says that one can approximate a function by a Fourier series having a number of 
terms proportional to the spectral norm of the function. And so on. 

The value of the spectral norm of a function emerges as a key to many of its computational 
properties. In order to exploit results such as those listed above, it becomes important to know more 
about this value. In particular, which functions have small spectral norm? Which don't? How does 
this relate to their complexity in various models of computation? These are the kinds of questions on 
which I will focus. 

This Report. I begin by describing a couple of the motivating results mentioned above. 

Next I look at upper bounds on the spectral norm, where what one is really interested in, of course, 
is to identify the functions of small norm. Here I present two very simple examples. 

I then look at the spectral norm of decision trees, an interesting case of how the spectral norm of 
a function can relate to its complexity in some model of computation. 

In §6 I look at lower bounds, presenting two examples of functions whose spectral norm is close to 
the highest possible value for any boolean function. 

Finally, I turn to a generalization of the usual spectral norm which I call the g-norm. Results 
about the (/-norm being much scarcer in the literature (it is a recent invention of Furst, Jackson and 
Smith [FJS]) I will spend more time here on developing techniques and applications. I conclude by 
considering briefly the relative merits of the two norms. 

What's New. Among the results and proofs included here, some are new. The principal amongst 
these are 

• A technique for upper bounding the spectral norm of a decision tree over an arbitrary basis. 

I introduce a general framework which enables one to compute, in a simple manner, a weight 



which for any decision tree is an upper bound on its spectral norm. This extends the result of [KM] 
which applied only to decision trees over the parity basis. As a consequence, many new classes of 
decision trees become learnable in polynomial time. 

• An extension of the [KM] learning algorithm to mutually independent distributions. 

I generalize the proof of [KM] to show that their algorithm works to learn functions in time 
polynomial in their g-norm when the error of a hypothesis is measured by its proximity to the 
target under the mutually independent distribution q. 

• A new proof of an exponential (and almost optimal) lower bound on the spectral norm of the 
majority function. 

This is only new in the sense that I have not seen any other proof (however I know that the 
result is known: I have just been unable to track down the reference). 

Apologia. It being the first time I have looked at Fourier analysis, I have not hesitated to state 
the obvious. In order to tie results together and set the stage, I have often had to prove small things 
which are not in the papers only, I'm sure, because they are considered too basic. 

I include my proofs anyway. To disambiguate, I have followed the convention of citing in the 
theorem statement the source from which I took it: if nothing is cited it means I couldn't find it 
anywhere. 

Finally, I must apologize for the many things I have omitted. It seemed better to stick to a single 
thread than to try to compile a compendium of results in the assigned papers, and of course much 
that is interesting is lost. 

2 Basics 

The set of real valued functions on {0, 1}™ is a 2" dimensional vector space over the reals. Choosing 
the "right" basis for this vector space yields the Fourier series. 

Here I'll go through the basics, as briefly as possible. I'll give the definitions and a minimum of 
the rich set of properties of Fourier series, developing more as the need arises in later sections. 

Notation and Conventions. I identify a string z € {0, l} n with the subset of {0,1}" whose 
charecteristic vector is z. With this convention, I will apply set-theoretic notation to strings, using 
expressions like "i £ z" or "yAz" where y,z are n bit strings and 1 < i < n. 

The empty string is denoted A. We adopt the (usual) convention that {0, 1}° = {A}. 

Boolean functions for us mean functions whose range is { — 1,+1}. 

Fourier Series. We define the inner product of j,g : {0, 1}" ->■ R by 

(f,g) = 2- n E, e{ o,i } ~/(^M^). 

Note that (/, g) = E[/#] where the expectation is over the uniform distribution on the inputs. 
The norm associated to this inner product is ||/||= \f{f,f)- 
The parity functions are the functions X z ■ {0, 1}" — ► { — 1, +1} defined for z £ {0, 1}" by 

x.(x) = H(-iy> . 



{Xz}ze{o,i}" is an orthonormal basis for the vector space of real valued functions on {0,1}" (the 
orthonormality is with respect to the inner product defined above). 

Suppose / : {0,1}" -> R. Then / = £, e {o,i}» I{z)X z where f(z) = (f,X,)- This (unique) 
expansion of / in terms of the parity basis is its Fourier series. The Fourier coefficients of / are the 
real numbers f(z) (z £ {0, 1}"). The sequence of Fourier coefficients is also called the spectrum of /. 
The Fourier transform is the operator T defined by F(f) = /. This operator is linear: T(af + g) = 
aT(f) + T{g) for real a and functions f,g : {0, 1}" -► R. 

The spectral norm (or just norm) of / is L(f) = E* e {o,i}» 1/00 1- In tne context of norms, "small" 
means polynomially bounded (as a function of n) and "large" means not small. 

Parseval's identity 

E 2EW ]J» 2 =ll/ll 2 

follows directly from the orthonormality of the basis. It follows that if / is boolean then 

£*€{o i}" f( z Y = 1- A consequence of this last fact is the following 

Proposition 2.1 /// is boolean then L(f) < 2"/ 2 . 

A useful property of the parity functions is that X a Xi = XaAb- 

Some Intuition for the Boolean Case. Some intuition about what the Fourier coefficients of 
a boolean function / measure is provided by Linial, Mansour and Nisan [LMN]. They observe that 
f(z) = Pr[/(z) = Xz(x)] - Pr[/(z) ^ X 2 (x)] (the probabilities being over a random choice of the input 
ar). Thus one can think of the Fourier coefficient f(z) as measuring the ability of / to approximate 
the parity of the bits in z when the input is chosen at random. 

Matrix Formulation. The following formulation of the Fourier transform in terms of matrices is 
often useful. 

The Sylvester type Hadamard Matrices are defined inductively by H — [1] and 



H n+ i 






Let [/] denote the vector (of length 2") whose components are the values of f(x) in lexicographic 

order. Then 

Proposition 2.2 [Br] Let f : {0,1}" -»■ R. Then [f] = 2- n H n [f}. 

3 Two Motivating Results 

Fourier analysis has of late proved to be a very useful tool in learning and computational complexity 
theory. And in many recent results, the spectral norm plays a pivotal role. Here are two examples. 

3.1 Efficient Learnability and the Spectral Norm 

Learning is hard. How hard? The spectral norm of the target function, say Kushilevitz and Mansour 
[KM], gives us some indication. Let us see how. 

The Model. The learning algorithm A receives as input l",e,<5 and, as an oracle, the function 
/ : {0,1}" -> { — 1,+1} which it is trying to learn (the latter means that as part of its normal 



A B (i n ,e,6;f) 

S <r- Coef(l n ,A,e5(n)- 1 ;/) 

h(-)*-sign(^ zeS f(z)x z ) 
return h 

Coef(l n ,a,0;/) 

{ Returns a superset of { a/3 : \f(a{3)\ > } } 

if a €{0,1}" then return {a} 

else if E[/2] > 2 then return Coef(l n ,a0, 0; /) U Coef (l n ,al, 0; /) 
else return 

Figure 1: The Learning Algorithm Ab 

operation the algorithm can query the value of / on an input x and receive the answer in one step). 
The output of A on these inputs is (an encoding of) a hypothesis h : {0, 1}" -> R. The error of the 
hypothesis with respect to the input / is Err h (f) = Pr[/(z) ^ h(x)] (the probability is over a random 
choice of a;). We can now say what it means to learn a concept class. 

Definition 3.1 For each n let C" be a collection of functions from {0, 1}" -► {-1, +1}. We say that 
the concept class C - U„>i C" is learnable if there exists an algorithm A (of the kind described above) 
such that Pr[&a ( i», Mi/ )"(/) < e] > 1 - S for every / G C n and every n and e,6 > (the probability 
is over the coin tosses of the algorithm). 

Note that this is not distribution free learning in the sense of Valiant [Va]. Not only is the algorithm 
allowed membership queries but also the error of the hypothesis is measured with respect to the 
uniform distribution on the inputs. 

The Result. Let C B = Un>iQ where C n B is the class of functions from {0,1}" -» {-1,+1} whose 
spectral norm is bounded above by B{n). 

Theorem 3.2 [KM] Let B : N — > N. Then the concept class C B is learnable in time polynomial in 
5(n),n,e _1 and log<5 _1 . 

Broadly speaking, this says that a concept class is learnable in time polynomial in a bound on the 
spectral norm of the functions in the class. 

The Algorithm. The learning algorithm is presented in Figure 1. Let me explain, very briefly, what 
is happening. 

The idea is to approximate / by the sum of a small number of "special" terms of its Fourier 
series. These special terms are those which have a high Fourier coefficient: specifically, one can show 
that if S C {0,1}" includes all strings z satisfying \f(z)\ > ci?(n)- 1 then Y,zes f( z )X* is a g ood 
approximation to /. 

Call z special if \f(z)\ > 0. The main component of the algorithm is a subroutine Coef which 
finds all the special terms of/: Coef (1", A, 0; /) returns a set of n bit strings containing all the special 
ones (A is the empty string). 

Coef works recursively, building up the special strings from the empty string: if a € {0, l} k then 
Coef (l",a, 0;/) returns (a superset of) the set of all special strings which have prefix a. The routine 



looks at the function f a : {0, 1}"-* -»■ {-1, +1} defined by f a (x) = E^€{o,i}-* f( a P)X()( x ) : the value 
of E[/J] is used to prune the search and keep the running time polynomial. 

Finally, let me remark that the algorithm as stated is somewhat inaccurate in that it assumes 
we can compute both E[/^] and f(z). The truth, however, is that both can be sufficiently well 
approximated for our purposes. 

Proofs? They await §7.2 where I will show something more general. The correctness of the above 
will follow as a special case. 

3.2 Sparse Representations and the Spectral Norm 

One of the drawbacks of the Fourier series, from a computational point of view, is that it is just too 
large: the series, after all, has an exponential number of terms. So we seek to approximate functions 
by sparser Fourier series. But how sparse a series can one get? The answer, again, is in the spectral 
norm. The result of Siu and Bruck [SB] that I will describe here shows the existence of approximations 
to a function / by a Fourier series having a number of terms which is polynomial in the spectral norm 
of/. 

Applications abound, particularly in the area of threshold circuit complexity [Br], [IIMPST], [BS], 
[SB]. 

The Result. The theorem of Siu and Bruck [SB] improves on work in [BS]. The proof I will present 
here is a slightly modified version of the one in [SB]: I make the probabilistic method more explicit 
and use Chernoff bounds to avoid the variance calculation. 

Lemma 3.3 (Chernoff Bound) Let X x , . . . ,X m be independent, mean random variables in the range 
[-1,1] and letX = YT=i X i- Then 

Pr[|X| > A] < 2e~ A2/2m 
for any A > 0. 

Note: Siu and Bruck [SB] assume the domain of their functions is { — 1, +1}", rather than {0, 1}" as 
I assume here. So the statement of the following theorem will look a little different from theirs. It is 
however easy to see that the two formulations are equivalent. 

Theorem 3.4 [SB] Let /{0, 1}" -> {-1,+1} and let k > 0. Then there exists a set S C {0,1}" 
and integers M,w z (z G S) such that \f(x) — g(x)\ < n~ k where g : {0,1}" — ► R is defined by 
9{x) = jjJ2.es w*Xz(x). Moreover \S\,\M\ are < 0(n 2k+l L(f) 2 ) and w z £ {-1,+1} (z G S). 

Proof: The proof uses the probabilistic method. Define 17,- = {0,1}" U {-Li} with 

Prn(z) = / WMB if ze {o,i}" 

\ 1 - E ae {o,i}» \a a \/B otherwise, 

where B — \L(f)~\ and a z = f(z). Our sample space will be fi = YiiLi ^» w ^h the induced probability 
measure. Now define random variables 

z .j z i z »\ = I sign(a z ,)xA x ) if *'' G {0,1}" 

I otherwise 



and let G x = £ Ya=i %,*■ Then 



Pr [\G X - f(x)\ > n- k ] = Pr 



^n 7 Nf(x) 



B 



> 



Nn 



-k 



B 



2 n 2k + \ 



and noting that E[Z t> ] = f{x)/B this is < 2e- Nn "Z 852 by Lemma 3.3. Choosing N = 165 
this is < 2~" and thus 

Pr [\G X - f(x)\ < n~ k for all x £ {0, 1}"] > . 

Thus there is a point (z 1 , . . .,z N ) in our sample space such that \G x (z 1 ,... ,z n ) - f(x)\ < n~ k for all 
x € {0,1}" and we can conclude the proof by setting g = j;Ya=i sign(a 2 i)x Z ' (actually the sum is 
only over the i for which z l is in {0, 1}"). ■ 

The reader is referred to [SB] for applications to the construction of small depth threshold circuits. 

4 Upper Bounds: Functions of Small Norm 

Convinced that functions of small norm are nice functions, the first thing we would like to do is 
certainly to find examples of them. In this section I present two simple examples. 

AND and OR. For any z € {0,1}" we define AND,, OR, : {0,1}" -► {-1,4-1} by 

AND, (a:) = (-l)A^*- 

011,(3) = (-i)V.^-. 

Proposition 4.1 L{ AND,),I( 0R 2 ) < 3. 

Proof: It is not too hard to compute explicitly the complete spectrum of these functions. Details 
omitted. ■ 

Comparison. Sometimes it is possible to show that a function has small norm by exploiting some 
intrinsic property of it (for example, a convenient inductive structure). The comparison function is an 
example [SB]. 

The comparison function C n : {0,1}" —>■ { — 1,+1} is defined for even n by 

C <xv) - i 1 lix ~ y 
^n{xy) - i _ 1 otherwige 

where x,y € {0,!}"/ 2 are identified with the integers X)?ii x i^~' an d Ya=i Vi^~' respectively. 

Proposition 4.2 [SB] Let n be even. Then L(C n ) = f + 1. 

Proof: The proof is by induction on k = n/2. For the base case k = 1 it is easily verified that 
C 2 = f Xoo + f Xoi - 5X10 + |Xn and hence L(C 2 ) = 2. For k > 1 we note that 

C 2k+2 {xy) = * — - + — - — L C 2k {x 2 ...x k+1 y 2 ...y k+l ) . 

It follows that L(C 2k+2 ) = L(C 2k ) + 1. ■ 



5 The Spectral Norm of Decision Trees 

A particularly interesting aspect of spectral norms is how they can relate to the computational com- 
plexity of the function in some model of computation. The first (to my knowledge) example is the 
result of [KM] on parity decision trees. 

Here I present a general framework for upper bounding the spectral norm of an arbitrary decision 
tree in terms of the spectral norms of the functions in its nodes. Extending the result of [KM], this 
yields many examples of classes of decision trees of small norm. 

5.1 A General Framework 

The Model. A decision tree is a binary tree in which each node is labeled by a boolean function 
{0, 1}" -»■ {-1,+1}. The tree computes (or defines) a function / : {0, l} n -» {-1, +1} in the natural 
way. On any input x € {0, 1}", we start at the root and compute the value at x of the function g by 
which the root is labeled. We branch according to this value: to the left if it was -1 and to the right 
if it was 1. We continue in this way until we hit a leaf. Here we compute the value at x of the function 
g by which this leaf is labeled, and this value is the value of the function / on input x. 

Note that we do not for now restrict the functions labeling the nodes: any boolean function is 
allowed. 

I will identify the decision tree with the function it computes. Thus I will speak of the Fourier 
transform or the spectral norm of a decision tree, and write things like T{z) or L(T) correspondingly. 
Similarly I identify a node with the function which labels it and speak of the Fourier transform or 
norm of a node. 

Finally, |T| will denote the number of nodes in T. 

Degrees. We assign to each function a number which we will call its degree. The degrees of the 
constituent nodes will enable us to measure the spectral norm of a decision tree. The definition itself 
is very simple: we let deg(g) = \[L{g) + 1]. So the degrees are, strictly speaking, the spectral norms 
themselves: it is just more convenient to translate slightly. 

Note that L(g) > 1 for any boolean g by Proposition 6.1. So 1 is also the minimal degree of any 
boolean function. 

Weight of a Decision Tree. We will assign to each decision tree a weight which is computed as 
a function of the spectral norms (or, more precisely, the degrees) of its constituent nodes. The weight 
w(T) of a decision tree T is defined by induction on the structure of the binary tree T as follows: 

• If \T\ = 1 then we let w(T) = L(g) where g is the function labeling the (single) node which 
comprises the tree. 

• If \T\ > 3 then T is as depicted in Figure 2 and we let w(T) = deg{g)[w{T_ 1 ) + w(Ti)]. 

Useful Facts. Before we can prove the main theorem we need to develop some facts about the 
spectral norm. 

Lemma 5.1 L(fx s ) = L{f) for any function f . 
Proof: We have 

fXs(z) = 2- n Z*f(x)Xs(x)x z (x) = 2-"EJ(s)X Ji ,(x) = f(sAz) 




0. 




Figure 2: The decision tree T 
and the lemma follows from the fact that the map z^^Azisa permutation of {0, 1}". ■ 

Lemma 5.2 Suppose f,g : {0,1}" — ► R. Then 

(1) L(f + g)<L(f) + L{g) 

(2) L(fg)<L(f)L(g). 

Proof: By the linearity of the fourier transform we have F(f + g){z) = F(f)(z) + T{g)(z). Apply 
the triangle inequality and then sum over all z to get (1). 

Let a z — g{z). Then making use of (1) we have 

Hfg) = L(f-J2 z a zXz ) = L(^ z a z fx z ) < ^2 z L(a z f Xz ) = £ z \a z \L(f Xz ) . 
But by Lemma 5.1 this equals 

E>,|£(/) = Z(/)£,k| = Hf)L(g) 
which establishes (2). ■ 



Main THEOREM. The main theorem is now very simply stated. It says that the spectral norm of a 
decision tree is at most its weight. 

Theorem 5.3 Let T be a decision tree. Then L{T) < w(T). 

Proof: The proof is by induction on the structure of the tree. The base case of T having only one 
node is easily verified. Now suppose T has > 3 nodes. Visualize T as depicted in Figure 2. Now 
observe that 

T(x) = ^[l-g(x)]T. 1 (x) + ^[l + g(x)]T 1 (x) 

= \T_,{x) + \T x (x) - \T. x {x)g{x) + ^(x^x) . 
By Lemma 5.2 we get 

L(T) < ^L(T. 1 ) + ^L(T 1 ) + ^L(T. 1 )L(g) + ^L(T 1 )L(g) 
= i [L{g) + 1] UT-r) + \ [L{g) + 1] L(T,) 
= deg(g)[L(T_ 1 ) + L(T 1 )] . 

But by induction £(T_i) < w(T_i) and L(T X ) < wfa) so the above is < deg{g)[w(T_ 1 ) + wfc)] = 
w(T). m 



5.2 Some Applications: Decision Trees of Small Norm 

Usually, we are interested in decision trees all of whose nodes are from some small basis of functions. 
For example, [KM] consider decision trees where the nodes all compute parity functions. In many of 
these special cases, the weight of the tree is easy to compute. Theorem 5.3 can then be applied to 
obtain a useful upper bounds on the spectral norm of the tree. 

This technique is quite general, and one can use it to bound the spectral norm of trees with all 
kinds of combinations of functions in the nodes. Here I illustrate with just two examples. 

Parity Decision Trees. Let us begin by deriving the result of [KM] that the spectral norm of a 
parity decision tree is bounded above by the number of nodes in the tree (a parity decision tree is a 
decision tree each of whose internal nodes is labeled by a parity function X, and each of whose leaves 
is labeled +1 or —1). 

Corollary 5.4 [KM] Let T be a parity decision tree. Then L(T) < \\T\/2\. 

Proof: By the main theorem it suffices to show that the weight of a parity decision tree T is < [|T|/2] . 
The important observation for this is that deg{X s ) = 1- The fact that w(T) < \\T\/2] is now easily 
established by induction using the definition of w. ■ 

Characterization of Degree 1 Functions. We note that the only fact about the parity functions 
that we used in the above is that deg(x s ) = 1- Thus the theorem extends to decision trees each of 
whose nodes is labeled by a function of degree 1. As the following characterization of degree 1 functions 
shows, however, these decision trees are actually identical to parity decision trees. 

Proposition 5.5 Suppose g is boolean. Then deg(g) - 1 if and only if ' L(g) = 1 if and only if g - ±X S 
for some s. 

Proof: It is clear that deg(g) — 1 if and only if L(g) = 1, and it is also clear that L(g) = 1 if g - ±X,- 
To complete the proof it suffices to show that if L(g) = 1 then g - ±X S for some s. So suppose 
L{g) — 1. Then we have 

1 = \9(x)\ = IE,ff(*)X,(*)| < E,»,(i)l = L(g) = 1 

and thus | J2 Z d( z )X z (x)\ = £* \g(z)x z {x)\. It follows that for each x either g(z)x z (x) > for all z or 
g(z)x z {x) < for all z. Moreover the former happens when g{x) = 1 and the latter when g(x) — -1, 
because g(x) = Y, z 9( z )Xz{x)- So fixing any z such that g{z) ■£ (at least one such must exist) we 
can conclude that sign(g(z)) - X z (x) if g(x) - 1 and sign{g{z)) = -X z {x) if g{x) = -1. But this is 
true for all x, and thus g is either x z or —X z - ■ 

So the parity functions and their negations are exactly the boolean functions of minimal degree. 

Computing the Weight. Before proceeding to more applications let us derive a more useful for- 
mulation of the weight function. Namely, 

w ( t ) = e n <mo ■ 

leaves I ancestors i of I 

That is, the weight is the sum, over all leaf to root paths, of the product of the degrees along these 
paths (we are identifying a node with the function labeling it, so that deg(i) means the degree of the 
function labeling node i). 



AND AND OR. We looked at the spectral norm of the AND and OR functions in §4, and by 
Proposition 4.1 we know that deg(AND 2 ),deg(0R z ) < 2. So these functions are just a step higher 
than parity in the degree hierarchy, and the natural choice of an addition to the basis to get a richer 
class of decision trees. 

Corollary 5.6 Let T be a decision tree over the basis {x,, A~ND„ OR, : s e {0, 1}"} with the property 
that any root to leaf path has at most 0(log n) AND, or OR s nodes. Then L(T) < iTlrc ^. 

Proof: The weight is easily bounded using the above expression. ■ 

I don't have space for more examples, but these should have sufficed to illustrate quite well how the 
main theorem can be applied. 

6 Lower Bounds: Functions of Large Norm 

I now turn to lower bounds on the spectral norm. I'll first (briefly) consider arbitrary real valued 
functions and then move on to the real case of interest: boolean functions. 

6.1 General Functions 

Let's begin with the following observation. 

Proposition 6.1 L(f) > max r \f(x)\ for any function f : {0, 1}" — ► R. 

Proof: For any x it is the case that 

L{f) = E,l/(*)l = EJ/(*)x,(*)l > IE,/(*)x,(*)| = |/(*)|. ■ 

This means that for a function to have polynomially bounded spectral norm, its output must be at 
most 0(log n) bits long. Many simple and interesting functions are thus ruled out. Addition of two n 
bit numbers is one such. 

One quickly realizes however, that the above is not really a restriction. In most applications 
(learning is an example) the output can be viewed bit by bit. That is, we view the function as a 
concatenation of boolean functions, and consider the spectral norm of each of these boolean functions 
separately. So we should concentrate on boolean functions. 

6.2 Boolean Functions 

It is probably not surprising that 

Theorem 6.2 Most boolean functions have large spectral norm. 

We can view this as a corollary of the fact that functions of small norm are efficiently learnable 
(Theorem 3.2). A direct proof is not as obvious as it might seem (and I haven't seen one). 

Can we construct a specific example of a boolean function of large norm? The answer is yes. The 
function is the "inner product mode 2" (it is "mode," not "mod") defined for even n by 

n/2 

in(x) = n(-ir*- iA ** 

«' = 1 



10 



Theorem 6.3 [MS], [Br] Let n be even. Then L{I n ) = 2"/ 2 . 

Proof: This proof uses the matrix representations of Fourier transforms that we introduced in §2. 

We claim that [J„] is an eigenvector of H n with eigenvalue 2" /2 . Proposition 2.2 then implies that 
\h\ - 2 _n/2 [/ n ], and the result follows. The fact that [/„] is an eigenvector of H n with eigenvalue 2" /2 
is easily established by induction on k = n/2. ■ 

This proof is from [Br] who however cites [MS] for the result. 

We note that by Proposition 2.1 the bound of Theorem 6.3 is tight. 

A few other simple examples of functions of exponential spectral norm are known [Br],[BS]. The 
proofs in all cases are based on inductive ideas like the ones above (and indeed the functions are 
constructed to have enough inductive structure to make these proofs go through). I'll now consider a 
somewhat different kind of boolean function: majority. 

6.3 An Exponential Lower Bound for Majority 

The majority function M n : {0, 1}" —>■ {-!,+!} is defined by 



v ' —1 otherwise. 



2 



What's Known. Siu and Bruck [SB] show that the spectral norm of M n is not polynomially bounded. 
Their proof relies on Theorem 3.4 and results from [Br] and [HMPST]. The results from [Br] and 
[HMPST] are first used to show that M n is not approximable (to within o(rc _1 ) error) by any sparse 
Fourier series. That L(M n ) is not polynomially bounded then follows from Theorem 3.4. 

Assuming this was the best known ([SB] did not indicate otherwise) I proved the exponential lower 
bound which appears as Theorem 6.4 below. Later I read [BOH] in which, in a different context, 
appear equations for the complete spectrum of the majority function. A lower bound at least as good 
as Theorem 6.4 is easily derived from this. 

They attribute the spectrum to [Ka]. I was however unable to find [Ka] anywhere, and thus do 
not know the proof of the result stated in [BOH]. 

I should note, though, that [SB], [BOH] all consider only the case of n being odd (they are interested 
in majority as a linear threshold function and the case of n even is inconvenient because Yli x > could 
be 0). My proof applies (only) to n even. 

Before going on it is also worth noting that techniques along the lines of those used in the proof of 
Theorem 6.3 would not appear to be useful here because the majority function has no clear inductive 
structure. 

The Lower Bound. The following exponential lower bound is within a polynomial factor of optimal: 

Theorem 6.4 Let n > 1 be a multiple of 4. Then 

4f>-i/" 
L(M n ) > -=— 2"/ 2 . 
V27rn 



Let's proceed to the proof. 



11 



The attack is pretty direct. What I will do is compute explicitly the "middle" Fourier coefficients 
(that is, those corresponding to the parity over a size ra/2 subset of the variables), and get a bound 
based on these alone. 

A Combinatorial Identity. I will need the fact that 

This is probably known, although I haven't been able to find a reference. For completeness, I provide 
in Appendix A a simple proof via generating functions. 

Computing the Middle Fourier Coefficients. The following lemma gives explicit values for 
the middle Fourier coefficients of the majority function. 



Lemma 6.5 Let z £ {0, l} 4 * have exactly 2k ones. Then 

(^,^>=2- 4i W(-l) 1 . 

Proof: We begin by observing that the value of the fourier coefficient (M 4k ,X z ) depends only on the 
number of ones in z. So without loss of generality we can set z = l 2k Q 2k . Now let A = 2 4k (M 4k ,X z )- 
We claim that 



2k 



*-£$<-"• 



(i) 



whence the result follows by our combinatorial identity. 

To establish Equation (1) we begin by noting that M 4k (x)x z (x) = (-1)* M ik (l i 2h - i V Q 2k ~ 3 ) where 
i is the number of ones in the left half of x and j is the number of ones in the right half of a;. It follows 
that 



A 



= EE ( 2 -) ( 2 ,*V-i)'M 4t (r o»-W*->) 



2k 



! = 



E l 2k )(-i) j 



2k 

E 

j = 2k-i 



2V 



2fc-t'-l /np 



E 

j=0 



The i = k term of the above sum equals ( 2 fc *) (-l) k . Let B be the remaining part of the sum. Noting 
that ( 2 -)(-iy = ( 2 2 k k _ l )(- 1 ¥ k ~ i we can rewrite B 



k-l 



£ft<-" 



2k 

E 

j=2k-i 



'2k y 



as 



2fc— i— 1 



2k 



t'-l 



- E 



'*)+±(*VW 



i=° 



j=o 



Using the fact that ( 2k ) = (^ .) the term in square brackets simplifies to 



E 



2k; 



2k 

E 

j = i+l 



2k 



2k\ ™ 2k' 



i-1 

E 

i=o 



r 2fc N 



= 2 



^2^^ 



12 



and putting all this together we get 

as desired. ■ 

Proof of Theorem 6.4. The proof of Theorem 6.4 follows easily now. First let us recall the bound 

f 2n\ e- 1 / 6 " 




i > ^=^2 2n (2) 

(which is obtained using Stirling's formula). Now let k - n/4. Then L(M 4k ) = Yl* \( M 4k,X z )\ > 
J2\z\=2k \(M<ik,Xz)\, and this by Lemma 6.5 and Equation (2) is at least 



as desired. 



7 Generalization: g-norras 

I now turn to a generalization of the basic Fourier series and the corresponding spectral norms (q- 
norms). Results about the g-norm being much scarcer in the literature (it is a recent invention of Furst, 
Jackson and Smith [FJS]) I will spend more time here on developing techniques and applications. In 
particular, I present a new application. 

The motivation, once again, comes from learning. 

One of the drawbacks of the [KM] learning algorithm is that the quality of the hypothesis is 
measured by its proximity to the target under one particular fixed distribution: the uniform one. Can 
we move any closer to some kind of distribution free learning? 

History, luckily, provides a parallel. Linial, Mansour and Nisan [LMN] had used Fourier analysis 
to develop an algorithm to learn AC functions under the uniform distribution. Furst, Jackson and 
Smith extended this result to what they called mutually independent distributions. In order to do this 
they developed a more general Fourier transform in which the basis is defined in terms of the mutually 
independent distribution q. 

The tools and framework introduced by [FJS] provide an avenue to similarly generalize the [KM] 
learning algorithm. Measuring the quality of a hypothesis by its proximity to the target under q, we 
will be able to learn in time polynomial in the g-norm. 

Here I will present this generalization of [KM] and then try to say something about g-norms and 
their relation to the usual norm. 

7.1 The g-Framework 

I must begin by developing the machinery. Most of the work, in fact, is here: once things have been 
properly set up the lemmas of [KM] can be generalized in a pretty straightforward manner. 

Mutually Independent Distributions. A probability distribution q : {0,1}" ->■ [0,1] is mutually 



13 



independent if the random variables xi, . . . ,x„ are independent. 

We will be assuming that < Pr[x,- = 1] < 1 for each i. We say that q is polynomially bounded if 

1 1 



max 

Ki<n 



+ 



.Pr[ii = 1] Pr[«,- =0]J 
is polynomially bounded (as a function of n). 

g-FouRlER Series. Let q be mutually independent. We define the inner product of f,g : {0, 1}" — > R 

by 

Note that (f,g) q = E[/#] where the expectation is over q. 

The norm associated to this inner product is ||/||,= v/(/?/)j- 

Let fj, { = Pr[x,- = 1]. Note that /x,- is the mean of the random variable a;,-. Let the standard 
deviation \//Xj(l — Hi) of x,- be denoted by cr,-. 

For z, a; £ {0, 1}" we define 

cj>(z,x) = n * g . * • 

We call 4> the basis associated to q because of the following 

Proposition 7.1 [Ba],[FJS] Let q : {0,1}" — > [0,1] 6e mutually independent and let <f> be as defined 
above. Then {<f>(z, -)}^e{o,i}" is an orthonormal basis for the vector space of real valued functions on 
{0,1}" (the orthonormality is with respect to the inner product {•,•)$,). 

Note that (f>(z, •) = Xz when q is the uniform distribution. 

To get a feel for these definitions, let's check the orthonormality. For a,b e {0, 1}" we have 



<#a, ■),#&,■)>, = E 



n^n 



Hi 






= E 



tea "■ jet J 
and the mutual independence implies that this equals 



n 

i£aC\b 



Hi J, i 



n 

i G a A b 



l^i % i 



He 

zGanfe 



fJ'i & % 



n e 



H'i %i 



<Ji 



Now each term of the first product is 1 while each term of the second is 0, and thus the above expression 
is 1 if a = b and otherwise. 

The Fourier series of / : {0,1}" — > R is £^e{o i}» f( z )<f>( z i x ) where f{z) — (f,<p(z,-)) q . The 
(/-spectral norm (or just g-norm) is L q (f) = X^e{o n- |/(^)|. Parseval's identity as usual says that 
Z^e{o,i}» f( z ) = ll/llj- 

I'm abusing notation a little by denoting by f(z) both the Fourier coefficients under q and the 
(usual) Fourier coefficients with respect to {Xz}. The context should usually make it clear which I 
mean, and if both are around I will disambiguate. 

The Shifted Distribution and Basis. With minor changes the above is the framework of [FJS]. 
I now have to go a little further. 



14 



Let q be mutually independent and <f> the associated basis. Suppose 1 < k < n. 

For y e {0, 1}* and < j < n - k we define q^y) = Vi[x j+l = y u .. .,x j+k = y k ]. Observe that qj 
is a mutually independent distribution on {0,1}*. 

For notational convenience we identify q with q. 

For x, z € {0, 1}* and < j < n - k we define 



&(*>*) = n 



H'j-^-i i 



We also define <^(A, A) = 1 for all j = 0, . . . ,n. 

For notational convenience we identify (f> with cj>. 

The fact that ^ is a mutually independent distribution on {0, 1}* implies, by Proposition 7.1, that 
{(f>j(z,-)}z€{o,i} k is an orthonormal basis for the vector space of real valued functions on {0, l} k (the 
orthonormality is with respect to the inner product {•, -} 9j ). 

We'll denote by E ; the expectation over qj. 
A Switching Lemma. One very useful property of the ^-basis is that 

for any a,b G {0,1}* and any j = 0, . . . ,n - k. 

7.2 Efficient Learnability and the g-Norm 

Fix a mutually independent distribution q. The only change in the learning model is that the proba- 
bility in the definition of the error Err h (f) = Pr[/(ar) / h(x)] is now taken over q (rather than over x 
chosen at random). 

The Result. Let C B (q) = \J n >i Cb(<i) where C B (q) is the class of functions from {0,1}" ->■ {-1,+1} 
whose spectral g-norm is bounded above by B(n). 

Theorem 7.2 Let B : K —>■ N. Let q be a polynomially bounded, mutually independent distribution. 
Then the concept class C B (q) is learnable in time polynomial in i?(n),ra,e _1 and log 8~ l . 

Note that when q is the uniform distribution this reduces to the [KM] result. 

The algorithm is shown in Figure 3. Again, one can check that when q is the uniform distribution 
it is just the algorithm of Figure 1. 

To show that it works, I now generalize the three basic lemmas of [KM]. I'll state and prove the 
lemmas without elaborating overly; the discussion in §3.1 should suffice to see why these lemmas imply 
the correctness of the algorithm, and for even more detail the reader is referred to [KM]. 

The Lemmas. Fix a mutually independent distribution q and its associated basis <f>. Let / : {0, 1}" -» 
R (note we are not requiring / to be boolean). Suppose < k < n - 1 and a € {0,1}*. Define 
/„ : {0, 1}"-* -+ R by f a (x) = E^o.i}-* /(«/?)^(/?, x). 

The first lemma shows that it suffices to find the terms of sufficiently large Fourier coefficient. The 
proof remains identical to the one in [KM]. 



15 



5*-Coef(l n ,A,€5(n)- 1 ;/) 

h(-)^-sign(j: zeS f(z)cj>(z,.)) 
return h 

Coef(l n ,a,0;/) 

{ Returns a superset of { a/3 : |/(a/3)| > } } 
if a€{0,l} n then return {a} 
else Let k be the length of a 

if Ejt[/2]>0 2 then return Coef(l n ,aO, 0; /) U Coef (l n , al, 0; /) 

else return 

Figure 3: The Generalized Learning Algorithm AB,q 

Lemma 7.3 Let e be > and suppose S D {z : \f(z)\ > eL q {f)~ l }. Let g(x) = Exgs f(*)<K z > x )- 
Then E[(/ - g) 2 } < e. Moreover if f is boolean then Pr[/(a;) ^ sign (#(x))] < e. 

Proof: 

E[(/-<?) 2 ] = E 2£M »[/W-sWF = T., ts K*?, 

and the last of these is at most 

max|/(^)|.E, e{ o,i } n|/WI < eL^f)- 1 ■ L q (f) = e. 
For boolean / we note that Pr[/(s) f sign(g(x))] < Pr[|/(a;) - g(x)\ > 1] < E[(/ - gf]. ■ 

The next lemma is used to bound the search time. I state it slightly more generally than [KM] so that 
the case of non-boolean functions is covered. 

Lemma 7.4 At most a \\f\\ 2 q -2 fraction of the functions f a (a <E {0, l} k ) have ~E k [f 2 ] > 2 • 
Proof: We observe that 

E E *[/«] = £</«>/«>,* = E(e^/(«w*(^-),e /j 7(«W*(/', 

a a a 

which by the bilinearity of the inner product and the orthonormality of the basis is 

£/(«0) 2 =11/11? • 

a/3 

The lemma follows. ■ 

Note that ||/|| ? < L q (f) and hence if £,(/) is polynomially bounded then so is ||/|| ? . 

Finally we have to show how to approximate E k [f a ]. It can be done by sampling provided we can 
approximate /„ itself. By equating /„ with an expectation the following lemma provides the avenue 
to obtain the necessary approximation to it. 

Lemma 7.5 f a (x) = E[f(-x)4>(a, •)]. 



16 



Proof: Making liberal use of the identities in §7.1 we have 

f a ( X ) = y,K«p)W,x) 

= J2 f(y z )M a i y)i(y z ) J2 'M/ 3 ' z )fa{Pi x ) 

yz 



= Y^f(y z )M a ,y)qo(y)\-r7-; (4>k(. z r),M x r)) qk 
= ^2f{yx)<i>o{u,y)(io{y) 

y 
= B[f(-x)4>(a,.)]. M 

Learning Non-Boolean Functions. The restriction that the target function be boolean is not 
really necessary. For the general case we change the definition of the error to Err h (f) = E[(/ — h) 2 ] 
(the expectation, as usual, being over 5). We also require the target functions to have polynomially 
bounded output length under some encoding of the output into bits. Finally the algorithm of Figure 3 
is modified to output the function </(•) = Ylz^s f{ z )4>z{-) rather than just its sign. 

7.3 The g-norm of Decision Trees 

The results of §5 can also be generalized by appropriately extending the definition of the degree. 

Namely, let q be mutually independent and g : {0, 1}" — > R. Then the degree of g with respect to 
q, denoted deg q (g), is defined to be the least positive real number d with the property that L q (fg) < 
(2d - l)L t (g) for all / : {0,1}" - {-1, +1}. 

It should be noted that when q is the uniform distribution this definition of the degree coincides 
with the one in §5. However I suspect that in general it is not true that deg q (g) = \[L q (g) + 1]. 

The weight w q (T) of a decision tree T is then be defined by induction as before, using the new 
definition of the degree, and the proof of the main theorem can be extended to show that L q [T) < w q (T) 
for any decision tree T. I omit the details. 

7.4 L q versus L 

Results such as the above motivate us to understand the relationship between the usual spectral norm 
and the ^-spectral norm. In particular, can the change of basis and distribution drastically reduce the 
spectral norm? 

Proposition 7.6 There is a function of exponential spectral norm which has polynomial spectral q- 
norm with respect to some polynomially bounded, mutually independent q. 

Proof: Let q be defined by Pr[x,- = 1] = | for each i = 1, . . . , n. Let (f> be the associated basis. Then 
L((f>(z,-)) > 2>\ z \l 2 (this follows from Proposition 6.1 since |</!>(z,0 n )| = S^'/ 2 ). So for z having, say, a 
constant fraction of ones, the (usual) spectral norm of cf>(z, •) is exponential. 



17 



■-•*»»M««* .l^a»^ 



The q-noim of #*, •), on the other hand, is of coarse jast I ww* «K*, •) is a basis function. ■ 
So #*, •) is not karaahte by the original algorithm ktdtk learaod hy the generalized oae. 

PaOitlM. Fiad am example el a aaeiestt faactioa / tad a aolyaoeaially boaaded, omtaaUy indepen- 
dent q sach that L(f) m set poiyacwaiaUy boaaMted Vat £^/> ie. 
I doat kaow how hard thk »; I hama't had tiaws to try. 



18 



Acknowledgements 

Thanks to Ron Rivest for suggesting (and writing) a lisp routine for evaluating J2i=o ( ; ) (~^)' — I 
wouldn't have got the answer without it! 

References 

[Ba] Bahadur, R,. "A Representation of the Joint Distribution of Respones to n Dichotomous 

Items," Studies in Item Analysis and Prediction (edited by H. Solomon), Stanford Uni- 
versity Press (1961). 

[BOH] Brandman, Y., A. Orlitsky and John Hennessy, "A Spectral Lower Bound Technique for 

the Size of Decision Trees and and Two-Level and/or Circuits," IEEE Transactions on 
Computers 39(2), 282-287 (February 1990). 

[Br] Bruck, J., "Harmonic Analysis of Polynomial Threshold Functions," SIAM J. Discrete 

Math. 3(2), 168-177 (1990). 

[BS] Bruck, J. and R. Smolensky, "Polynomial Threshold Functions, AC Functions, and Spec- 

tral Norms," Proceedings of the 31st Annual IEEE Symposium on the Foundations of 
Computer Science, IEEE (1990). 

[FJS] Furst, M., J. Jackson and S. Smith, "Learning AC Functions Sampled under Mutually 

Independent Distributions," Manuscript (October 1990). 

[HMPST] Hajnal, A., W. Maass, P. Pudlak, M. Szegedy and G. Turan, "Threshold Circuits of 
Bounded Depth," Proceedings of the 28th Annual IEEE Symposium on the Foundations 
of Computer Science, IEEE (1987). 

[Ka] Karpovsky, M., "Finite Orthogonal Series in the Design of Digital Devices," Wiley (1976). 

[KM] Kushilevitz, E. and Y. Mansour, "Learning Decision Trees using the Fourier Spectrum," 

Manuscript (November 1990). 

[LMN] Linial, N., Y. Mansour and N. Nisan, "Constant Depth Circuits, Fourier Transform, and 

Learnability," Manuscript (January 1991). Preliminary version in Proceedings of the 30th 
Annual IEEE Symposium on the Foundations of Computer Science, IEEE (1989). 

[MS] Mac Williams, F. and N. Sloane, "The Theory of Error- Correcting Codes," North- Holland 

(1978). 

[SB] Siu, K. and J. Bruck, "On the Power of Threshold Circuits with Small Weights," 

Manuscript. 

[Va] Valiant, L., "A Theory of the Learnable," Communications of the ACM 27(11), 1134-1142 

(1984). 



19 



A Appendix: Proof of the Combinatorial Identity 

Here I give a proof of the combinatorial identity used to derive the lower bound on the spectral norm 
of the majority function. Let's begin by recalling the identity: 



E( 2 *)ViV = ("!)* ^ 



2k 

k 

2k\ . 7 _ /2k\ 



The proof uses generating functions. We define a,- = ( 2 *) and b { = ( 2 { )( — 1)' and let 
Let Yli^o c i x% ^ e *^ e formal power series for G. By definition of G we have 

Ik 2k f 9h \ 2 

: = i = V * / 

So it suffices to show that c 2k = (-l)*^*)- To do this we compute G(x) explicitly. By the binomial 
theorem £~ O' = (1 + xf* and £,~ (?)(-l)'V = (1 - a;) 2 *, so 

G(x) = (l + a;) 2 *(l-a:) 2 * = (1 - z 2 ) 2 * . 

Applying the binomial theorem again we get 

g(*) = £( 2 *)(- 1 ) < « w » 

and thus c 2 k = ( — 1)*( 2 /) a s desired. 



20